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[57] ABSTRACT 

A method for recognizing text in graphical drawings 
includes creating a binarized representation of a drawing to 
form an electronic image of pixels. The image is discrimi- 
nated between text regions and lines in the image by 
grouping pixels into blocks and comparing blocks with a 
predetermined format to identify text regions. The lines that 
remain in the text regions are removed to create text only 
regions. The text is recognized in the text only regions. 

20 Claims, 4 Drawing Sheets 
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METHOD OF SEARCHING AND like image despeckeling and enhancement, proper interpre- 

EXTRACTING TEXT INFORMATION FROM tation of text within the context of. line drawing images 

DRAWINGS needs the user to manually outline the text regions, which is 

tedious and time consuming. 

BACKGROUND 5 Therefore, a need exists for an automated method to 

locate keywords in engineering drawings and documents to 
1. Field of the Invention create proper AIUs for cross-referencing in hypermedia 
This disclosure relates to a method for extracting text documents. Most of the suggested prior art methods do not 
information and, more particularly, to search documents and optimally use the underlying geometry and domain-specific 
drawings for text characters mixed with graphics. 10 knowledge to achieve the task of text separation. It is 
2 Description of the Related Art desirable for the above-mentioned method to make use of 
P the geometry and length of the text strings that are to be 
Effective and automated creation of hypermedia have identified in order to localize them. These localized regions 
recently received immense focus due to the high demand for are tnen analyzed using an OCR software to extract the exact 
hypermedia applications generated as a result of the huge text content. Further, the method must be amenable to user 
and growing popularity of the World Wide Web (WWW). 15 manipulation and input to adapt to the variability of the class 
Unfortunately, the creation of hypermedia to date continues of documents under consideration. The friendly user inter- 
to be a laborious, manually intensive job and in particular face should also allow corrections at different stages of the 
the task of referencing content in drawing images to other procedure. 

media. In a majority of cases, the hypermedia authors have SUMMARY OF THE INVENTION 

to locate anchored information units (AIU's) or hotspots 20 . . . . 

(areasorke y wordsofparticularsignific aD ce)whichareth e D . A method for recognizing text in graphical drawings 

appropriately hyperlinked to relevant information. In an J ncIud<5S creating a binanzed representation of a drawing to 

electronic document the user can retrieve associated detailed form * n electromc image of pixels. The image is discnmi- 

information by mouse clicking on these hotspots as the nated between text and lines in the unage by grouping pixels 

system interprets the associated hyperlinks and fetches the » into blocks and comparing blocks with a predetermined 

corresponding information. format t0 ldentlf y P 0SSlble text "S 100 *- 1)36 bnes tbat 

„ . remain in the text regions are removed to create text only 

Extraction of AIUs is of enormous importance for the regions. The text is recognized within the text only regions, 

generation of hypermedia documents. However, achieving ^ removal of Iines ^ be perfonned to stages . For 

this goal is nontnvial from raster images. This is particularly e , a line removal m prior t0 tQe 

true in the case of scanned in images of engineering docu- discrilninat i on of text and Hncs and a ^ Unc removal can 

mentewh,ch primarily consist of hne drawings of mecham- fce rformed afier potential text regions are identified, 

cal parts with small runs of text indicating the part or group _ . . - . . . - 

r , 0 , . , . , , • !_ , . In particularly preferred embodiments the step of opti- 

number. Scanned-m mechanical drawings have machine r f r , . . . . , , ^ . . r ,. . , 

* .u * i u i j u * . . • r™ . , , - - , cally scan nine the drawing is included. Creating a binarized 

parts that are labeled by text strings. These text strings point 3 . 6 . . j . r • -i 

f iU . . u- _i-i * . -a* 35 representation may include the steps of comparing pixel 

to the relevant machine parts. One way to create an mdex for \ J . , \ j ■ • *t_ 

. , . ^ | j i ** *u * * a * * values to a grey-scale threshold value and assigning the 

the machine parts would be to point on the associated text. . . 4 , ... , . . ~. ■ u Z. 

~. . « it _ r r . 4 it . *• 4l _ pixels to be either white or black. Discriminating between 

Obviously, the areas ot interest to the hypermedia author and r , . , , -ij * r 

4 iL j iL . . . . \T , -j *u * text regions and graphical lmes may mclude the steps of 

to the end user are those text strings that identify the part , 4 . . ,57 . • i j * 

, i * j j * • r *• ti. * * i determining a distance between each pixel and nearest 

numbers or other related document information. This is also . ... . & , . . _,. . . 

. •*!.* *u r 1- j 40 neighhbonng pixels, comparing each distance to a predeter- 

lroportant within the scope of making drawings more con- . j j. * 1 • • t u i . • i -.u* *u 

tenVreferable in electronic document!. mm f? "f^* , m A "T* 

predetermined distance to form pixel blocks having the same 

What makes this problem challenging is the indistinguish- , abelj and ^p^g pixeI blocks t0 predetermined text 

ability of text from polylines which constitute the underlying f ormats f or identifying text regions. Once text regions are 

line drawings. This also partially explains the paucity of 45 idcntified grap hical lines may be removed by subdividing 

reliable products that can undertake the above-mentioned ±c spacc mt0 a ^ definmg grid spaces, counting 

task. While developing a general fit-for-all algonthm that black pixek associated with text and graphics lines in each 

would work for all kinds of line-drawing images is almost grid comparing a count of black pixels to a prede- 

impossible, solutions can be achieved by making use of termined count for determining if the grid space contains a 

underlying structures of the concerned documents. 5Q Une and removing lines by changing black pixels to white 

Currently, most available methods cannot be used reliably pixels, 

for drawing images. They can primarily be categorized as Alternate embodiments include the step of recognizing 

follows: (a) raster-to-vector converters and (b) traditional i^s in the text only regions by using optical character 

OCR methods mainly used for text documents. Due to the recognition software. The method may further include the 

similarity between text and polylines, extraction of text from 55 step of cleaning up the" image by median filtering and the 

line drawings is a very difficult task. While the raster to step 0 f creating anchorable information units for hypertext 

vector converters treat the whole image as consisting of line creation, 
drawings only, the OCR software packages presume the 

whole image is text. In the first case text is converted to line BRIEF DESCRIPTION OF THE DRAWINGS 

drawings, in the second case line drawings are attempted to 60 The invention will be described in detail in the following 

be interpreted as text. While the first category of products is description of preferred embodiments with reference to the 

clearly irrelevant within the present context, the second following figures wherein: 

category leaves the task of culling out the relevant material FIG. 1 is a flow chart showing basic operations of an AIU 

from all the "junk" that it produces as a result of misreading extractor; 

line drawings as text. 65 FIG. 2 is a flow chart showing the steps for finding and 

Several prior art software packages fall within this cat- extracting text from a graphical drawing as in step 12 of the 

egory. While they can both accomplish some preprocessing flow chart in FIG. 1; 
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FIG. 3 is a flow chart showing substeps for performing FIG. 3 shows more detail for step 100. For most engi- 

step 100 of FIG. 2; neering drawings or documents, the images are either grey- 

FIG. 4 is a flow chart showing substeps for performing kvcl or binary. In the case where they are binary, then a step 

step 200 of FIG 2* becomes redundant. Step 102 is still incorporated within 

«' ' . , • . r r - 5 the procedure because drawings are often not binary, or it is 

FIG. 5 is a flow chart showing substeps for performing de sired to operate on a smaller grey level image. Since grey 

step 300 of FIG. 2; and j cvel images are definition multilevel, a smaller image 

FIG. 6 is a flow chart showing the steps for text recog- can easily depict as much information as can be depicted 

nition as in step 14 of FIG. 1. within a bigger binary image. Converting from grey level to 

DFTAir Fn nFSrRTPTIONr OF PREFERRED ™ binary P rimarilv ™°l™s thresholding. Thresholding 

DETAILED D^RIPTION fOF PREFERRED includes assigning a grey-level value threshold as a refer- 

EMBODIMENTS ence lQ be compare( j to eacn individual pixel in a two 

The present disclosure describes an automated method to dimensional document space x by y. The level is so selected 

locate keywords in engineering drawings to create proper that the background continues to be white and all the other 

anchorable information units (AIU's) or hotspots (areas or 15 pixels belonging to either the text or the line drawings 

keywords of particular significance) which are then appro- appear as black. A step 110 compares the grey-level value of 

priately hyperlinked to relevant information. The method each pixel. If it is below the threshold value, it is assigned 

provides an integrated image processing and OCR technique to be a black pixel in a step 108. If it above the threshold 

for a superior handling of mixed t&x j/graphic (MTG) im ages value, it is assigned to be a white pixel in a step 106. Step 

of documents or engineering drawings. Image processing 2 o is repeated for all pixels. 

techniques are used to first ready the image for further A step 104 is an optional step and is carried out only in a 
processing by binarizing it and removing all probable lines, certain class of images. Step 104 is implemented only when 
creating a binarized line-removed (BLR) image. Mixed the text and the line drawings are intermingled all over the 
Text/Graphic (MTG) regions are then extracted from raster document image. If there is a relative degree of separation 
images of engineering drawings, which are then filtered to 2S that can easily be judged by the user, step 104 is omitted, 
extract the Filtered Text (FT) Regions. Existing OCR tech- Step 104 removes the lines from the drawings. This is done 
nology is used only on these areas. This avoids blindly through the use of a Hough Transform which is described in 
applying OCR to non- textual areas and guarantees increased U.S. Pat. NoT3 ,069,654 issued to P.~V. C. Hough and is j 
efficiency. The method can precisely identify the proper incorporated herein by "reference. In parametric form, the 
areas (FT Regions) to apply OCR technology in this drawing 30 equation of a line can be given as x cos (6)+y sin(0)==p. First, 
AIU extracting process. This is crucial for successful extrac- the parameter space (p,8) is subdivided into a grid of cells, 
tion of text runs identifying part numbers and other relevant Every black pixel in the image is associated with a cell 
information from raster images of engineering drawings. which has corresponding values ofp and 8. Once all the 
The existing products do not presently have such a capabil- black pixels are accounted for, a membership count for each 
ity. This method has a systematic approach that automati- 3S cell in the (p, 8) plane is taken. Those cells which have a 
cally extracts the relevant text from images of paper draw- high count are considered as possible lines. After obtaining 
ings. The method further allows user input to aid in tailoring the equations of possible lines in a step 112, the image plane 
the results. is revisited to see if indeed there is a line satisfying the 
Referring now in specific detail to the drawings in which equation given by the (p, 8) value passing through a par- 
like reference numerals identify similar or identical elements 40 ticular black pixel in a step 116. In a step 114 and a step 118, 
throughout the several views, and initially to FIG. 1, a flow the black pixel is replaced by a white one if it is found that 
chart shows the main steps in a drawing image AIU extrac- the line is of a certain minimum length within a neighbor- 
tor. As discussed above, the input to the system is scanned hood of a particular black pixel. This cross-checking is 

t images of engineering drawings. A first step 12 locates text necessary to rule out false alarms that might arise out of 

boundaries and a second step 14 uses this input to extract the 45 alignment of text strings or other non-line structures, 

actual text content. Based on the appearance and underlying Referring back to FIG. 2, a step 200 includes image 

geometry and structure, first step 12 identifies these text classification and grouping. Step 200 involves the use of a 

areas. Text areas are identified by first extracting the mixed multi-step algorithm including smearing algorithms 

text/graphic (MTG) regions from raster images of engineer- described in G. Nagy, S. Seth, and fvi. Viswanathan, "A 

ing drawings, which are then filtered to extract the filtered 50 prototype document image analysis systems for technical 

text (FT) regions. See hereinafter for more details. In second journals". IEEE Computer, 25:10-22, 1992. However, 

step 14 an optical character recognition (OCR) toolkit is modifications to this algorithm need to be made for the type 

used to read the text content within these regions and the of documents handled. Text strings have a certain height that 

final output is the AIUs which are then sent to a hyperlinker depends on the font size used. The spacing between black 

for further processing. Each of steps 12 and 14 are interac- 55 pixels which is related again to the font size' and style, can 

five in the sense that should a computer make any mistakes, in general be specified to satisfy a certain upper bound. Step 

the user can go in and correct them. 200 consists of the following the substeps described in FIG. 

Referring now to FIG. 2, step 100 involves converting an 4. 

image of an engineering drawing or document Jp binary ^ Referring to FIG. 4, a step 202 is uses the smearing 

format an d cleaning up the resulting image for further 60 algorithm. This operation is primarily engaged in the hori- 

processing. The input to the system is typically a raw raster zontal direction. A step 208 determines the distance between 

image and the desired output is generally a Binarized two nearest black pixels. For drawing images, where it is 

Line-Removed (BLR) Image. Engineering drawings may be possible to have vertical lines of text, step 208 is performed 

optically scanned to create a binarized image or the drawing in both horizontal and vertical directions. Within an arbitrary 

may exist as an electronic file on a computer. FIG. 2 is 65 string of black and white pixels, white pixels are replaced by 

associated with step 12, text region finder, of FIG. 1, and black pixels if the number of adjacent white pixels between 

FIG. 6 is associated with step 14, text recognizer. two black pixels is less than a predetermined constant in a 
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step 210 and a step 212 . This constant is clearly related to the end, for each label assignment to a pixel (x, y), the following 

font-size and can be user-defined. The effect that step 202 is computed: 
has is that it closes all the gaps that exist between different 

letters in a word and reduce a word to a rectangular block of xmm-mm x min) 

black pixels. However, it also affects the line drawings in a 5 xmax-max {x, x max) 
similar fashion. The difference here is that by the very nature 

of their appearance, text words after step 202 look rectan- ym'm-mm (y, y min) 
gular of a certain height (for horizontal text) and width 

(assuming that the part numbers that appear in an engineer- ymax=max ^ y max; 

ing drawing are likely to be of a certain length). Fortunately, Qnce the MTG Region boundary has been computed, we 

the line drawings generate irregular patterns making them calculate the number of black pixels in each label by 

discernible from associated text. inspecting the area included within each bounding box. 

Since the line removal step 104 is likely to miss some , Jf*< te &]? a boundaries have been computed a step 

lines, a step 204 can clean up die image after executing step is £JJ J £. mduded '° ^ releVaD ' P 2 """ 6 ' 61 * °* 

202 by running a median filtering operation. A median , 

filteringfor a vector g of size N (where the vector in this case L hei & 1 of each block: 

consists of the image pixel and its neighbors) is given by: wid\h~xmax-xirun 

20 

media n(g>=R lW2] (g) hcighl-ymax-ymin 

■ n . tl _ k.t rytM j * c 2. The aspect ratio of each block: 

where R^ /2 ] 15 "* e N/2 order statistic of g. r 

Depending on the size of the neighborhood chosen, step 204 a^width/height 

removes small islands of black pixels which are not likely to 25 

have been detected due to the presence of an underlying text. 3 xh e 0 f eacn block: 

A step 206 is initiated after completion of step 204. A step 

216 labels the black pixels in order to be able to identify each i4-widthxheight 
block separately. A modified version of a standard technique 

is used whereby as the first line is scanned from left to right, 30 4. The ratio of the number of black pixels to the surround- 

a label 1 is assigned to a first black pixel. This label is ing area: 

propagated repeatedly in a step 218. A step 222 assigns the 

smallest label amongst neighboring black pixels to subse- R-CtA 

quem adjacent black pixels, i.e. subsequent black pixels are 0Q ^ ^ discrimination ^ carried out in a st 

labeled identically. This propagation stops in a step 220 306 as described herein. A step 308 tests the label or region 

when a first white pixel is encountered. The next black pixe elers against predetermined values in order to identify 

is labeled 2 and is similarly propagated. This continues till ^ Wocks r ^ qds ^ m iQQ ^ QT ^ ^ m 

the end of the first line 15 reached. For each black pixel on elimiaated in a st 310> Le> 0Illy accept regions mat fall 

the second and succeeding lines, the neighborhood in the , • o _ f - m ;„ imiim nnA mom - miim ™„. 

, 4 f ... * r- ... . ,40 within a certain minimum and maximum area: 

previously labeled lines along with the left neighborhood 

(on the same line) are examined and the pixel gets assigned AminSASAmax 
the lowest label among them. If there are no labeled 

neighbors, the pixel gets a new label that has not yet been This removes a large number of false alarms. Usually part 

used. This procedure is continued until all the pixels in the numbers can only be of a certain number of characters, 

image has been examined. 45 Regions that have unexpectedly large or small width and 

Even after step 218 has been completed, it is likely to have arc ^ eliminated in step 310. The elimination 

some adjacent black pixels that are labeled differently. So an criteria is: 
iterative step 224 is initiated through the image to fuse such 

labels. The iteration is ended when a pass through the image 5Q 

results in no more label changes. At this point, a number of hcightminShcight^bcightmax 
labeled regions exist. However, they might have originated 

as a result of either text or line drawings and are thus called Since text regions are desired, regions that appear square 

Mixed Text/Graphics (MTG) Regions. shaped can typically be eliminated. Rectangular regions are 

^ Referring again to FIG. 2, a step 300 endeavors to filter 55 f ° CUSed Upon ' ™ S ° reateS ^ followin S ^ for hori ' 

out the text regions from the graphic regions. These regions Z0Dtal text arl amain and for vertical text 1/ar^armin. 

. are thus named as Filtered Text (FT) Regions. Regions that are relatively empty are also eliminated in a 

V, n c . . <*™ 1 Al step 312 and a step 310, i.e. the black pixels that are 

Referrmg now to FIG. 5 a s ep 302 can be partly m , ^J^^ non-rectangular way. This is 

combined wtb the previous step. 5>tep JUz involves nndmg me cbaracteristic of ^m,. and 

are unlikely to be 

text boundaries, ..e. be procedure of text discnmmation 60 ^ ^ for ^ J t is: 

Step 302 needs to first calculate the following properties of °^ 

the MTG Regions. Step 302 involves the calculation of a R£Rmin. 
bounding box around the different MTG Regions to filter out 

the text regions. A table is created that stores the coordinates The limits in the above are domain dependent and the user 

of the bounding box and- the number of black pixels within 65 has the ability to choose and modify them based on the 

each box, i.e. the number of pixels belonging to that label. characteristics of the document processed. The regions that 

The main objective is to extract the bounding box. To that satisfy all the above are declared as Fixed Text (FT) 



widthmin ^ width ^ widthmax 
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Regions. Before they are processed further the user may 
have the option to correct for mislabeled text and also 
outline missed text. 

Referring to FIG. 6, after the plausible text areas have 
been identified, a step 400 is to use an optical character 5 
recognition (OCR) toolkit to identify the ASCII text from 
the FT Regions. A step 402 basically .necessitates getting the 
image data ready for the OCR software. To speed up the 
process, the previous text boundary extraction is carried out 
on a smaller scaled down version of the image. However, to 10 
get satisfactory results, we need the OCR to operate on the 
actual version of the scanned image. Thus the FT Regions 
extracted previously need to be scaled appropriately. The FT 
Region boundaries are also slightly enlarged to assure 
proper focus for the OCR. Also the user may have the 15 
capability at this stage to define other parameters that might 
affect the performance of the OCR. 

A step 404 corrects any systematic error that might occur 
in the OCR. If the user is aware of the structure of the part 
numbers that are to be recovered, then the structure can be 20 
specified, for example, regular expressions, and check to see 
if indeed the output of the OCR satisfies such structures. 
Otherwise the user can rectify the errors in a step 406. Once 
the OCR is applied and the results verified, the underlying 
extracted text is associated to each one of the blocks 25 
completing the creation of the AIU's. Having described 
preferred embodiments of a novel method for searching and 
extracting text information from drawings (which are 
intended to be illustrative and not limiting), it is noted that 
modifications and variations can be made by persons skilled 30 
in the art in light of the above teachings. It is therefore to be 
understood that changes may be made in the particular 
embodiments of the invention disclosed which are within the 
scope and spirit of the invention as delined by the appended 
claims. 35 

Having thus described the invention with the details and 
particularity required by the patent laws, what is claimed and 
desired protected by Letters Patent is set forth in the 
appended claims: 

What is claimed is: 40 

1. A method for recognizing text in graphical drawings 
comprising the steps of: 

creating a binarized representation of a drawing to form 
an electronic image of pixels; 

discriminating text regions from lines in the image by 45 
determining a distance between each black pixel and 
nearest neighboring black pixels, grouping pixels into 
pixel blocks having a same label based on a relation- 
ship between the determined distance and a predeter- 
mined distance, and comparing the blocks with a pre- 50 
determined format to identify text regions; 

removing the lines from the image using a Hough Trans- 
form to create text only regions; and 

recognizing text in the text only regions. 55 

2. The method as recited in claim 1 further comprising the 
step of optically scanning the drawing. 

3. The method as recited in claim 1 wherein the step of 
creating a binarized representation includes the steps of 
comparing pixel values to a grey-scale threshold value and 60 
assigning the pixels to be either white or black. 

4. The method as recited in claim 1 wherein the step of 
discriminating text regions includes the steps of: 

comparing each distance to the predetermined distance; 
and 65 

assigning labels to pixels within the predetermined dis- 
tance to form the pixel blocks having the same label. 
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5. The method as recited in claim 1 wherein the step of 
removing the lines includes the steps of: 

subdividing the drawing space into a grid defining grid 
spaces; 

counting black pixels associated with text and graphics 

lines in each grid space; 
comparing a count of black pixels to a predetermined 

count for determine if the grid space contains a line; 

and 

removing lines by changing black pixels to white pixels. 

6. The method as recited in claim 1 wherein the step of 
recognizing text in the text only regions includes using 
optical character recognition software. 

7. The method as recited in claim 1 further comprising the 
step of cleaning up the image by median filtering. 

8. The method as recited in claim 1 further comprising the 
step of creating anchorable information units from the text 
only regions for hypertext creation. 

9. A method for recognizing text in graphical drawings 
comprising the steps of: 

creating a binarized representation of a drawing to form 
an electronic image of black pixels and white pixels; 

determining a distance between each black pixel and 
nearest neighhboring black pixels; 

comparing each distance to a predetermined distance; 

assigning labels to pixels within the predetermined dis- 
tance to form pixel blocks having the same label; and 

comparing pixel blocks to predetermined text formats for 
identifying text regions; 

removing the lines from the image using a Hough Trans- 
form to create text only regions; and 

recognizing text in the text only regions. 

10. The method as recited in claim 9 further comprising 
the step of optically scanning the drawing. 

11. The method as recited in claim 9 wherein the step of 
creating a binarized representation includes the steps of 
comparing pixel values to a grey-scale threshold value and 
assigning the pixels to be either white or black. 

12. The method as recited in claim 9 wherein the step of 
removing the lines incudes the steps of: 

subdividing the drawing space into a grid defining grid 
spaces; 

counting black pixels associated with text and graphics 

lines in each grid space; 
comparing a count of black pixels to a predetermined 

count for determine if the grid space contains a line; 

and 

removing lines by changing black pixels to white pixels. 

13. The method as recited in claim 9 wherein the step of 
recognizing text in the text only regions includes using 
optical character recognition software. 

14. The method as recited in claim 9 further comprising 
the step of cleaning up the image by median filtering. 

15. The method as recited in claim 9 further comprising 
the step of creating anchorable information units from the 
text only regions for hypertext creation. 

16. A method for recognizing text in graphical drawings 
comprising the steps of: 

creating a binarized representation of a drawing to form 
an electronic image of black pixels and white pixels; 

determining a distance between each black pixel and 
nearest neighhboring black pixels; 

comparing each distance to a predetermined distance; 

assigning labels to pixels within the predetermined dis- 
tance to form pixel blocks having the same label; and 
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comparing pixel blocks to predetermined text formats for 17. The method as recited in claim 16 further comprising 

identifying text regions; the step of optically scanning the drawing, 

subdividing the drawing space into a grid defining grid 18. The method as recited in claim 16 wherein the step of 

spaces; creating a binarized representation includes the steps of 

counting black pixels associated with text and graphics 5 comparing pixel values to a grey-scale threshold value and 

lines in each grid space; assigning the pixels to be either white or black, 

comparing a count of black pixels to a predetermined 19. The method as recited in claim 16 wherein the step of 

count for determine if the grid space contains a line; recognizing text in the text only regions includes using 

removing lines by changing black pixels to white pixels to io optical character recognition software. 

create text only regions; 20. The method as recited in claim 16 further comprising 

recognizing text in the text only regions; and the step of cleaning up the image by median filtering, 
converting the text to hypertext for creating an anchorable 

information unit. * * * + * 
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